I’m currently building a dataset using Hugging Face Datasets. Each image in my dataset has approximately 15 different annotations. My dataset is structured such that each data item corresponds to exactly one annotation (i.e., the i
-th annotation can be accessed directly with dataset[i]
). Thus, if I have one image with 15 annotations, it results in a dataset of 15 items, each containing the same image paired with a different annotation.
This approach currently causes inefficient storage usage because the same image is stored repeatedly.
I prefer to avoid external URLs or external storage solutions and would like to:
- Store each image exactly once within an Arrow file or similar bundled format.
- Store annotations with internal references to these images.
- Dynamically load and pair the image from this internal reference with the corresponding annotation upon accessing
dataset[i]
.
I have this specific constraint because I am interested in understanding whether this approach can be achieved without altering my existing framework. I understand that using external URLs or storage would simplify the problem significantly, but I’d like to avoid that if possible.
Does Hugging Face Datasets support this internal referencing mechanism to efficiently manage images without redundant duplication or external downloads? If so, could anyone provide guidance or examples on how to implement this approach effectively?